Vector Store, full separation codec / vectorstore #113

hemidactylus · 2025-02-13T16:49:14Z

Clean separation of codecs/vectorstore layers

This PR introduces a better separation of knowledge on what pertains to codecs and what constitutes the underlying logic of the Vector Store. As such,

fixes #106

Meanwhile, it provides a couple of useful querying tools which feel like they belong to the vstore+codec layer. (though another such tool, the "run_query" method, is postponed to a follow-up PR. That one is possibly the most important of these.)

Additionally, this restructuring of the code also makes a step toward a possible extension to cover API Tables without duplicating logic.

More in detail:

creation of "id queries" in the codec consistently (away from vectorstore)
moved id- and $vector-related parts of codec into the base class (not expected to vary under the data api)
codecs expose their "default indexing policy", vectorstore uses that knowledge in and around its constructor ( --> for coll. creation in particular)
codecs expose their "abstract metadata key to actual dot-notation field identifier" for building include/exclude policies specified by metadata fields
vectorstore does not say "_id" literally anymore (all is in codec; uses get_id)

A note on the name chosen for the get_id codec method. There is the unfortunate fact that LangChain calls "documents" its internal format, and Astra DB calls "document" what it stores. So ... codecs that lie between these two have sometimes a hard time with naming. Keeping it simple (get_id) should work in this case because only one of the two ends of the codec (the Astra side) can have a variable schema: then, only on that side should the need arise for a function abstracting the reading of an ID. (Which btw is more a formality as the "_id" field is one of those that will hardly ever change in Astra DB!)

…it cleanly passes through the coded layer all the time

bjchambers · 2025-02-14T16:06:46Z

libs/astradb/langchain_astradb/utils/vector_store_codecs.py

        """
+        return _default_encode_id(filter_id)
+
+    def encode_ids(self, filter_ids: list[str]) -> dict[str, Any]:


Two potential issues with this being a separate method:

For a user that has a query for "these IDs AND this predicate" they're going to need to write the _id filter themselves unless they have access to the codec.

We'd also need the rewriting that happens to special case the _id filter and not rewrite it to metadata._id.

I wonder if we should just adopt a $id or _id as the standar field for the id. Then, I don't know if we'd need the encode_ids -- we could just make the rewriter do the right thing in that case, and the user can provide { "$id": <id> } or { "$id": { "$in": [<ids>] }} depending on their needs in arbitrary places within the query.

bjchambers · 2025-02-14T16:08:42Z

libs/astradb/langchain_astradb/vectorstores.py

+                    doc_id = self.document_codec.get_id(document)
                    return await _async_collection.replace_one(
-                        {"_id": document["_id"]},
+                        self.document_codec.encode_id(doc_id),


The idea of allowing the codec to just be a filter rewriter actually would be pretty useful here as well. It would allow these to just be "do a query with codec.rewrite_filters({"$id": doc_id}) (don't remember the name it currently has) which feels like it's clearer what is being done. Then instead of having a bunch of methods on the codec that are used for creating queries, there is just the one that converts a "standard" query (on the Document basically) into a appropriately encoded query.

hemidactylus · 2025-02-17T15:17:24Z

@bjchambers Thank you for taking the time to review this. You make good points.
And you are right, a refinement of this separation is in order.

I have added the "do not merge" label while I go over it again. Here are the two points I take from your remarks.

I would still keep the codec unaware of logic such as "these filters are too many, let's split" (this does not belog there)
but I agree the codec should expose a more general method, which just translates "any query"

More on the second point: "any query", in LC world -- where Document objects have just: (1) id, (2) a vector, (3) metadata kv pairs and (4) an unindexed text -- means that the most general thing to query is:

zero, one or more IDs (if more: implied OR)
a search vector with its k for ANN
metadata conditions. These are: k: [v1, v2...] ==> implied OR, but also {k1:v1, k2:v2...} where it can be AND/OR equally. We could assume AND between different keys (do we worry about loss of generality on this?) or we can keep a Data-API-like syntax here.

What I'm trying to get to is, since the codec's "query translator" makes an abstract query into a payload good to go to Data API, its input could be required to be not a dictionary (which arguably makes usage more error-prone), rather a specific structure: instances of some AstraDBDocumentQuery class, to be then made into dictionaries only with the knowledge of the encoding scheme.

I can probably find some time to rework this PR in this sense tomorrow - would that capture the essence of your remarks? (plus helping with clarity since it requires data classes to express the abstractness of queries? Or perhaps too unwieldy?)

bjchambers · 2025-02-17T15:24:23Z

I like that direction and the observation it doesn't all need to be in a dict. That will keep the difference between id and metadata["id"] clear too.

I wonder if a data class is necessary. Could it just be:

def encode_query(self, ids: Iterable[str | int] = (), metadata: dict[str, Any] = {}):

I think keeping vector separate (it goes to sort, not the filter) could be reasonable.

hemidactylus · 2025-02-17T16:14:11Z

Right, and the metadata would be allowed to be anything (i.e. nested AND, OR, whatever). The current mechanism to rewrite with prefix unless it's a $-operator would be enough. Not a codec's responsibilty to split filters if they're too bulky.

Also agree that at this point the data class becomes useless weight. You convinced me: no data class :)

hemidactylus · 2025-02-17T23:55:38Z

I have replaced the "encode_id[s]" for a encode_query along the lines we discussed.

Some notes:

encode_query assumes the IDs are in AND with the metadata conditions (if both passed). (I think unlikely one wants and OR between those - in which case, running multiple queries and merging the results would be the way to go I believe)
the second parameter is named filter_dict for compatibility with the name throughout the VectorStore class, where it always means "metadata filters"
ids needs not be typed as str | int. In Langchain IDs are always strings I believe

I have adapted the unit tests (test_vs_doc_codecs.py, tests test_flat/default_query_encoding). Below a summary of the wonders of encode_query for your convenience:

from langchain_astradb.utils.vector_store_codecs import _DefaultVSDocumentCodec, _FlatVSDocumentCodec
d = _DefaultVSDocumentCodec(content_field='c', ignore_invalid_documents=False)
f = _FlatVSDocumentCodec(content_field='c', ignore_invalid_documents=False)


d.encode_query()
f.encode_query()
# both: {}


d.encode_query(ids=['id1'])
f.encode_query(ids=['id1'])
# both: {'_id': 'id1'}


d.encode_query(ids=['id1', 'id2'])
f.encode_query(ids=['id1', 'id2'])
# both: {'_id': {'$in': ['id1', 'id2']}}


d.encode_query(ids=['d'],filter_dict={'x':'y'})
f.encode_query(ids=['d'],filter_dict={'x':'y'})
# resp.:
#   {'$and': [{'_id': 'd'}, {'metadata.x': 'y'}]}
#   {'$and': [{'_id': 'd'}, {'x': 'y'}]}


d.encode_query(ids=['d'],filter_dict={'x':'y','z':'w'})
f.encode_query(ids=['d'],filter_dict={'x':'y','z':'w'})
# resp.:
#   {'$and': [{'_id': 'd'}, {'metadata.x': 'y', 'metadata.z': 'w'}]}
#   {'$and': [{'_id': 'd'}, {'x': 'y', 'z': 'w'}]}


d.encode_query(ids=['d'],filter_dict={'$or':[{'x':'y'},{'z':'w'}]})
f.encode_query(ids=['d'],filter_dict={'$or':[{'x':'y'},{'z':'w'}]})
# resp.:
#   {'$and': [{'_id': 'd'}, {'$or': [{'metadata.x': 'y'}, {'metadata.z': 'w'}]}]}
#   {'$and': [{'_id': 'd'}, {'$or': [{'x': 'y'}, {'z': 'w'}]}]}

hemidactylus · 2025-02-17T23:56:31Z

the difference between id and metadata["id"]

(an incidental note is that trying to use metadata._id as a literal metadata field may stop working as soon as one has a "flat" vector store - for reasons unrelated to this PR, more fundamental. I believe however that the code should not throw an error - suppose there is a legacy vectorstore out there that uses such a metadata field, created prior to the introduction of flat codecs and autodetect. Certainly not something to encourage, ...)

bjchambers · 2025-02-18T00:01:04Z

libs/astradb/langchain_astradb/utils/vector_store_codecs.py

+        *,
+        ids: Iterable[str] | None = None,
+        filter_dict: dict[str, Any] | None = None,
+    ) -> dict[str, Any]:


May merit pydoc indicating the implicit $and.

bjchambers · 2025-02-18T00:02:19Z

libs/astradb/langchain_astradb/utils/vector_store_codecs.py

+
+        if clauses:
+            if len(clauses) > 1:
+                return {"$and": clauses}


Agreed this makes sense. In general, my rationale is:

If you want ids OR filter, then run separate queries -- the IDs only query is likely to be fast and the filter-only query is likely to do a scan.

If you want ids AND filter then there is no option beyond running them together (unless you somehow emulate the full filtering semantics on the client side).

So, it seems like the AND is the only reasonable choice.

Stefano Lottini added 2 commits February 13, 2025 15:07

centralize codec's id/vector encoding; add multi-ids encoding

59c712c

move all indexing, _id and similarity management into codecs so that …

d92c15e

…it cleanly passes through the coded layer all the time

hemidactylus requested a review from epinzur February 13, 2025 16:54

hemidactylus changed the title ~~Sl vs full delegate to codec~~ Vector Store, full separation codec / vectorstore Feb 14, 2025

bjchambers reviewed Feb 14, 2025

View reviewed changes

hemidactylus added the do_not_merge Do not merge yet, requires further discussion label Feb 17, 2025

trading encode_id[s] for encode_query

1797685

bjchambers approved these changes Feb 18, 2025

View reviewed changes

them docstrings

1112af0

hemidactylus removed the do_not_merge Do not merge yet, requires further discussion label Feb 18, 2025

hemidactylus merged commit 3715b62 into main Feb 18, 2025
13 checks passed

hemidactylus deleted the SL-vs-full-delegate-to-codec branch February 18, 2025 22:12

hemidactylus mentioned this pull request Feb 21, 2025

Vector store's run_query method #114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vector Store, full separation codec / vectorstore #113

Vector Store, full separation codec / vectorstore #113

Uh oh!

hemidactylus commented Feb 13, 2025 •

edited

Loading

Uh oh!

bjchambers Feb 14, 2025

Uh oh!

bjchambers Feb 14, 2025

Uh oh!

hemidactylus commented Feb 17, 2025

Uh oh!

bjchambers commented Feb 17, 2025

Uh oh!

hemidactylus commented Feb 17, 2025 •

edited

Loading

Uh oh!

hemidactylus commented Feb 17, 2025

Uh oh!

hemidactylus commented Feb 17, 2025

Uh oh!

bjchambers Feb 18, 2025

Uh oh!

bjchambers Feb 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Vector Store, full separation codec / vectorstore #113

Vector Store, full separation codec / vectorstore #113

Uh oh!

Conversation

hemidactylus commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjchambers Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

bjchambers Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

hemidactylus commented Feb 17, 2025

Uh oh!

bjchambers commented Feb 17, 2025

Uh oh!

hemidactylus commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hemidactylus commented Feb 17, 2025

Uh oh!

hemidactylus commented Feb 17, 2025

Uh oh!

bjchambers Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

bjchambers Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hemidactylus commented Feb 13, 2025 •

edited

Loading

hemidactylus commented Feb 17, 2025 •

edited

Loading